wip: LLMs.txt toolkit local history from pr-7465 worktree by Mustafa-Esoofally · Pull Request #7541 · agno-agi/agno

Mustafa-Esoofally · 2026-04-15T19:36:53Z

Summary

Preserves the granular local history of feat/llms-txt-reader-tools from the pr-7465-llms-txt-fixes worktree. PR #7458 squash-merged this work, so main already has the feature — this branch keeps the 25 original commits for reference (review iteration history, type cleanup, import fixes, etc.).

Also captures an unrelated dirty file that was in the worktree:

libs/agno/tests/unit/os/routers/test_sort_order_default.py — cross-contamination from another worktree, triage separately.

Status

Safe to close. PR #7458 already merged the work. This exists only so nothing gets lost during worktree cleanup.

Add a reader and toolkit for the llms.txt standard (https://llmstxt.org), enabling agents to discover and consume documentation indexes. LLMsTxtReader: fetches an llms.txt URL, parses the standardized markdown format to extract all linked doc URLs, fetches page content (handling HTML, markdown, plain text), and returns Documents with section/title metadata. Async variant fetches all pages concurrently. LLMsTxtTools provides two modes: - Agentic: get_llms_txt_index returns the index so the agent picks which pages to read, then read_llms_txt_url fetches individual pages. - Knowledge: read_llms_txt_and_load_knowledge bulk-fetches all linked pages and inserts them into a Knowledge base. Includes 32 unit tests and 2 cookbook examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

## Summary Addresses code review feedback on #7458. Fixes several issues in the LLMsTxtReader and LLMsTxtTools implementation. **Changes:** - **Lazy BeautifulSoup import** - Deferred to `_extract_content()` instead of hard-failing at module import time - **Variable shadowing fix** - Renamed `url` to `entry_url` in `async_read()` dict comprehension to avoid shadowing the method parameter - **Concurrency limiting** - Added `asyncio.Semaphore(10)` to prevent overwhelming target servers when fetching 100+ URLs concurrently - **Better text extraction** - Changed `_extract_content()` separator from `" "` to `"\n"` to preserve document structure - **Public API methods** - Renamed `_fetch_url` / `_parse_llms_txt` to `fetch_url` / `parse_llms_txt` since they are called by the toolkit - **Reader reuse** - LLMsTxtTools now creates a single `LLMsTxtReader` instance in `__init__` instead of per tool call - **Async tool variants** - Added `aget_llms_txt_index`, `aread_llms_txt_url`, `aread_llms_txt_and_load_knowledge` registered via `async_tools` following the codebase convention (e.g. BrandfetchTools) - **New tests** - Added tests for async tool registration, reader reuse, and newline preservation in HTML extraction ## Type of change - [x] Improvement --- ## Checklist - [x] Code complies with style guidelines - [x] Ran format/validation scripts (`./scripts/format.sh` and `./scripts/validate.sh`) - [x] Self-review completed - [x] Documentation updated (comments, docstrings) - [x] Tests added/updated (if applicable) ### Duplicate and AI-Generated PR Check - [x] I have searched existing [open pull requests](../../pulls) and confirmed that no other PR already addresses this issue - [x] Check if this PR was entirely AI-generated (by Copilot, Claude Code, Cursor, etc.) --- ## Additional Notes All 36 tests pass (up from 32 - added 4 new tests for async registration, reader reuse, and HTML newline preservation).

- Full async docstrings on all 3 async tool methods so the LLM sees proper tool descriptions in async mode - AsyncClient now receives timeout and proxy via _async_client_kwargs() - Module-level httpx import consistent with Brandfetch/Perplexity - Extract _process_response() to deduplicate content-type classification across fetch_url and async_fetch_url

Instead of manually reading documents and looping insert(), delegate to self.knowledge.insert(url=url, reader=self.reader) which gives us content hashing, deduplication, status tracking, and proper vector DB insertion — matching the pattern used by WebsiteTools and WikipediaTools.

Reader: - Remove redundant state: in_optional and past_first_section replaced by single current_section variable - Remove dead if/else branch on proxy — httpx accepts proxy=None - Remove WHAT comments that restate the next line - Simplify AsyncClient construction (proxy=self.proxy directly) Toolkit: - Extract _format_index helper to deduplicate sync/async index building - Delegate knowledge loading to Knowledge.insert(url=, reader=) pipeline Knowledge: - Skip pre-download when custom reader is provided — URL-based readers like LLMsTxtReader need the URL string, not pre-fetched BytesIO

The overview document (title + summary from the llms.txt) provides essential context about the project. No caller ever set this to False. Removing the parameter and its branch simplifies the reader.

- Remove __init__ docstring (no other reader has one) - Rewrite parse_llms_txt: replace 3 continue statements with clean if/elif/else chain — each line falls into one bucket - Remove include_llms_txt_content param (always True, never exposed)

_extract_content was called exactly once. Inlining removes one indirection layer — the reader now has only the helpers that are actually shared between read() and async_read().

The 3-way exception split (HTTPStatusError, RequestError, Exception) was duplicated between sync and async. For a reader fetching doc pages, a single catch with a warning log is sufficient. Each method is now 4 lines instead of 12.

Keep the semaphore (Codex confirms: this is external HTTP fan-out, not local processing — unbounded gather would burst 100 requests at once). Remove _MAX_CONCURRENT_FETCHES constant, inline the value with a comment explaining why it exists.

Add timeout and follow_redirects params to existing fetch_with_retry and async_fetch_with_retry in utils/http.py. Reader now uses these shared utils instead of making raw httpx.get calls — retry logic, error handling, and connection management in one place. Removed semaphore — httpx AsyncClient already limits concurrent connections per host (default 20).

max_urls=100 was too high — would overwhelm model context in agentic mode. 20 matches the knowledge cookbook and WebsiteReader's max_links=10 ballpark. timeout=60 matches the global httpx client default.

bs4 import now fails at import time (matching WebsiteReader and WebSearchReader pattern) instead of deep inside a fetch call. LLMsTxtReader import moved to top of toolkit — no reason to defer an internal agno module.

Class docstring was a 30-line essay — most toolkits have none. The code structure already shows the two modes (with/without knowledge). Removed remaining WHAT comment in _build_documents.

- Trim tool docstrings: remove repeated llms.txt explanations, keep only what the LLM needs to decide when/how to call the tool - Replace _async_client_kwargs dict builder with _async_client() that returns the client directly - Add section comments to separate helpers / agentic tools / knowledge tools for scannable code - Remove unused Dict import

Docstrings now use the same format as GmailTools and GoogleCalendarTools: triple-quote, Args (type): description, Returns: type: description. Replaced section dividers with inline comments matching Gmail pattern. Helpers have no docstrings (underscore prefix signals internal use).

Toolkit: every tool method now wrapped in try/except returning error strings, matching Gmail/Calendar pattern. Helpers at top, tools below. Reader: reordered — __init__, classmethods, helpers (_process_response, _build_documents), then public methods (parse_llms_txt, fetch_url, read, async_read). Removed bloated docstrings on helpers. Trimmed class docstring to just the example.

tools list uses Callable instead of Any. Removed Any from kwargs (untyped kwargs is the codebase pattern — other toolkits don't type it).

@pytest

Restructured from class-based to flat functions with @pytest.fixture, matching test_perplexity.py and test_gmail_tools.py patterns. New coverage: - Async reader: async_read happy path + failure - Async toolkit: aget_llms_txt_index, aread_llms_txt_url, aread_llms_txt_and_load_knowledge - Error handling: try/except returns error strings - Edge cases: empty overview, HTML sniffing, unknown content-type - Shared _mock_httpx_response helper for DRY mock setup 34 tests -> 46 tests

The previous fix (skip pre-download when any custom reader is provided) broke PDFReader and other file-based readers that need BytesIO. Now we check if the reader supports ContentType.URL — only URL-based readers like LLMsTxtReader and WebsiteReader skip the pre-download. File-based readers (PDFReader, CSVReader, etc.) still get pre-downloaded bytes.

Only forward timeout and follow_redirects to httpx when explicitly passed by the caller. Previously, default values (timeout=None, follow_redirects=False) were always forwarded, which removed httpx's built-in 5s timeout and overrode client-level redirect settings.

follow_redirects and timeout use Optional[None] default so existing callers see zero behavior change. Build kwargs dict conditionally instead of type-ignore comments. Import order fixed by format.sh.

ashpreetbedi and others added 26 commits April 10, 2026 12:44

Merge branch 'main' into feat/llms-txt-reader-tools

f59dfe5

Merge branch 'main' into feat/llms-txt-reader-tools

13884e7

Merge branch 'main' into feat/llms-txt-reader-tools

c9d6789

fix: remove include_llms_txt_content parameter — always include overview

7d88c44

The overview document (title + summary from the llms.txt) provides essential context about the project. No caller ever set this to False. Removing the parameter and its branch simplifies the reader.

fix: inline _extract_content into _process_response

9039720

_extract_content was called exactly once. Inlining removes one indirection layer — the reader now has only the helpers that are actually shared between read() and async_read().

fix: change defaults — max_urls=20, timeout=60

62e75c0

max_urls=100 was too high — would overwhelm model context in agentic mode. 20 matches the knowledge cookbook and WebsiteReader's max_links=10 ballpark. timeout=60 matches the global httpx client default.

fix: move imports to module level — bs4 and LLMsTxtReader

1d8312f

bs4 import now fails at import time (matching WebsiteReader and WebSearchReader pattern) instead of deep inside a fetch call. LLMsTxtReader import moved to top of toolkit — no reason to defer an internal agno module.

fix: remove class docstring and WHAT comment

2ae73c2

Class docstring was a 30-line essay — most toolkits have none. The code structure already shows the two modes (with/without knowledge). Removed remaining WHAT comment in _build_documents.

fix: replace Any with proper types

beea0b0

tools list uses Callable instead of Any. Removed Any from kwargs (untyped kwargs is the codebase pattern — other toolkits don't type it).

fix: use Optional for new http util params, fix import order

4b6fb1d

follow_redirects and timeout use Optional[None] default so existing callers see zero behavior change. Build kwargs dict conditionally instead of type-ignore comments. Import order fixed by format.sh.

wip: checkpoint LLMs.txt local review-round history + stray test file

32b0aac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wip: LLMs.txt toolkit local history from pr-7465 worktree#7541

wip: LLMs.txt toolkit local history from pr-7465 worktree#7541
Mustafa-Esoofally wants to merge 26 commits intomainfrom
wip/cleanup-llms-txt-history-20260415

Mustafa-Esoofally commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

Mustafa-Esoofally commented Apr 15, 2026

Summary

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants